Scalable Message - Logging Techniques for Effective Fault Tolerance in Hpc Applications

نویسنده

  • Nitin H. Vaidya
چکیده

An important set of challenges emerge as the High Performance Computing (HPC) community aims to reach extreme scale. Resilience and energy consumption are two of those challenges. Extreme-scale machines are expected to have a high failure frequency. This is an inevitable consequence of the mismatch between two trends. The number of components assembled in supercomputers grows exponentially. However, the improvement on the reliability of each individual component is much slower. At the same time, the vast number of components in a single machine will consume a non-trivial amount of energy. To keep a supercomputer within operational margins, HPC systems have to be both reliable and energy-aware. For an application to be able to run and make progress in spite of constant interruptions, it has to incorporate some fashion of fault tolerance. Rollback-recovery techniques provide a framework to overcome crashes in the system by periodically saving the state of the application and rolling back to checkpoints in case of failures. Two well-known rollback-recovery techniques are checkpoint/restart and message-logging. The former is easier to implement and has become the de facto standard to make applications fault tolerant. It has, however, a high performance and energy cost during recovery. Message-logging, on the other hand, makes it possible to recover faster from a failure and to consume less energy. The downside of message-logging is the overhead it exhibits in the failure-free scenario. Memory and performance overheads may offset its advantages. This thesis focuses on techniques to alleviate the downsides of message-logging. It presents a mechanism based on high-level programming language constructs to decrease the performance overhead of message-logging. It also introduces two strategies to reduce the memory overhead created by the message log. Additionally, it addresses important architectural constraints of modern supercomputers. Based on large-scale experimental results and projections from an analytical model, we conclude message-logging is a promising strategy to provide fault tolerance at a low energy cost for extreme-scale machines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault tolerant; most are in need for a seamless recovery framework. Among the automatic fault tolerant techniques proposed for MPI, message logging is preferable for its scalable recovery. The major challenge for message logging protocols i...

متن کامل

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hi...

متن کامل

A Consensus-Based Fault-Tolerant Event Logger for High Performance Applications

High-performance computing (HPC) systems traditionally employ rollback-recovery techniques to allow faulttolerant executions of parallel applications. Rollback-recovery based on message logging is an attractive strategy that avoids the drawbacks of coordinated checkpointing in systems with low mean-time between failures (MTBF). Most message logging protocols rely on a centralized event logger t...

متن کامل

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the ...

متن کامل

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The runtime environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in parallel on the machine. The deployment of app...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013